Investigating speech style specific pronunciation variation in large spoken language corpora

نویسندگان

  • Christophe Van Bael
  • Henk van den Heuvel
  • Helmer Strik
چکیده

In the past, linguistic research was typically conducted on relatively small datasets that were specifically designed for the research at hand. Whereas to date many large spoken language corpora have become available, the usefulness of these corpora is still not fully established in linguistic research. The research reported on in this paper was conducted to illustrate the potential of large multi-purpose spoken language corpora for linguistic research. The possibility was investigated of identifying phonetic regularities in different speech styles. To this end, a datadriven study was conducted with a large multi-purpose spoken language corpus comprising a manually corrected broad phonetic transcription of the data. Our results show that speech style specific pronunciation processes can indeed be found in such a large corpus. This indicates that large multipurpose spoken language corpora can contribute to linguistic research, if only for the purpose of hypothesis generation and verification.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On the Usefulness of Large Spoken Language Corpora for Linguistic Research

In the past, fundamental linguistic research was typically conducted on small data sets that were handcrafted for the specific research at hand. However, from the eighties onwards, many large spoken language corpora have become available. This study investigates the usefulness of large multi-purpose spoken language corpora for fundamental linguistic research. A research task was designed in whi...

متن کامل

Automatic phonetic transcription of large speech corpora

This study is aimed at investigating whether automatic phonetic transcription procedures can approximate manual transcriptions typically delivered with contemporary large speech corpora. To this end, ten automatic procedures were used to generate a broad phonetic transcription of well-prepared speech (read-aloud texts) and spontaneous speech (telephone dialogues) from the Spoken Dutch Corpus. T...

متن کامل

Gender in everyday speech and language: a corpus-based study

This paper presents an exploratory study on the relations between gender and everyday parlance. A “data-mining” approach is used to explore gender-specific characteristics in a large number of spontaneous telephone and face-to-face conversations. Our study focuses on speech rate (speaking rate and articulation rate), disfluencies (filled pauses and repetitions), pronunciation variation (phoneme...

متن کامل

Analyzing and identifying multiword expressions in spoken language

The present paper investigates multiword expressions (MWEs) in spo­ ken language and possible ways of identifying MWEs automatically in speech corpora. Two MWEs that emerged from previous studies and that occur frequently in Dutch are analyzed to study their pronunciation characteristics and compare them to those of other utterances in a large speech corpus. The analyses reveal that these MWEs ...

متن کامل

Multiword expressions in spoken language: An exploratory study on pronunciation variation

The study presented in this paper was aimed at exploring the possibilities of modelling specific pronunciation characteristics of multiword expressions (MWEs) for both automatic speech recognition (ASR) and automatic phonetic transcription (APT). For this purpose, we first drew up an inventory of frequently found N-grams extracted from orthographic transcriptions of spontaneous speech contained...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004